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Abstract 

In 'no common mechanism' (NCM) models of character evolution, each character can evolve on a 
phylogenetic tree under a partially or totally separate process (e.g. with its own branch lengths). In such 
cases, the usual conditions that suffice to establish the statistical consistency of tree reconstruction by 
methods such as maximum likelihood (ML) break down, suggesting that such methods may be prone to 
statistical inconsistency (SIN). In this paper we ask whether we can avoid SIN for tree topology 
reconstruction when adopting such models, either by using ML or any other method that could be devised. 
We prove that it is possible to avoid SIN for certain NCM models, but not for others, and the results 
depend delicately on the tree reconstruction method employed. We also describe the biological relevance of 
some recent mathematical results for the more usual 'common mechanism' setting. Our results are not 
intended to justify NCM, rather to set in place a framework within which such questions can be formally 
addressed. 
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1. Introduction 



Statistical Inconsistency (hereafter, SIN) in phylogenetics is the tendency of certain tree reconstruction 
methods to converge on an incorrect tree topology when applied to increasing quantities of data that evolve 
under a given model. The p henomenon has be en well known for simple methods like maximum parsimony 
since the landmark paper of iFelsensteinl (|1978l ) three decades ago. SIN has contributed to the widespread 



acceptance of more sophisticated tree reconstr uction methods suc h as maximum l ikelih ood, corrected 



distance methods and Bayesian phylogenetics (jFelsensetein 



20041 ). (jLemev et al 



2009). These methods 



are based explicitly on stochastic models of sequence evolution, and for which it is usually possible to 
establish statistical consist ency when the model assumed by the inve s tigator is also the one that generated 



the data (see, for example, 



Chand (U996I) 



Allman and Rhodes! (|2006l ). 



Soberl |2008|)). 



A centrepiece of nearly all these models is the assumption that character data (for instance, genetic 
sequence sites) evolve independently and identically. This 'i.i.d.' assumption is standard in statistics, and 
implies that each character is described by essentially the same process and that the characters represent a 
finite random sample of this process. This i.i.d. assumption applies even for mainstream models that allow 
a distribution of rates across sites, such as the frequently used 'Gamma+F embellishment of the general 
time reversible (GTR) model. In these models, it is usually assumed that the rate at a site is chosen i.i.d. 
from a given distribution. Such a 'rates-across sites' model is subtly different from a model that assumes 
that each site has its own particular intrinsic rate (i.e. not chosen i.i.d. from some distribution) - let us call 
it a 'variable site rate' model. Within such a model, the sequence sites may still be independently 
generated, but they are not identically distributed (some sites simply are evolving faster than others). 



If we just consider the frequencies of site patterns, then the two models (rates across sites, and variable site 
rate) can produce (almost) identical data; however, significant differences between the models can become 
apparent when we come to do tree reconstruction from given sequences. For example, in a maximum 
likelihood approach to tree reconstruction, in which we explicitly assume the variable site rate model we 
may wish to estimate a corresponding rate for each site that maximizes the probability of observing the 



given site pattern - along with a shared u nderlying set of branc h lengths common to all the sites (such an 



approach was described by Gary Olsen in 



Swofford et al 



(jl9961 )). Each rate estimate - one for each site 



might later be discarded as a 'nuisance parameter' in the search for the underlying tree topology alone. 



This approach is quite different to doing the 'usual' form of maximum likelihood estimation of a tree 
topology under a rates-across-sites model. We can ask if such an approach is statistically sound - in 
particular, can it lead to SIN? What if we allow the branch lengths also to vary from character to character 
(the more usual form of 'no common mechanism')? Is maximum likelihood under this model liable to SIN; 
if so, can any method reconstruct a tree under this model without SIN? These are the sort of questions we 
will address. We will also describe the biological relevance of some recent mathematical results concerning 
tree reconstruction in the more usual 'common mechanism' setting. 



First we outline some of the motivations and concerns surrounding no common mechanism models in 
phylogenetics. We then discuss statistical consistency in a general setting - first for common mechanism 
models, where much is known, then for no-common mechanism models, where there has been little analysis 
to date in phylogenetics. In Theorem 1 we present some first results in this area and show how the details 
of the model (and the method) are crucial to whether we are in danger of SIN when working with a 
no-common mechanism model. We also describe different forms of SIN, and attempts to measure and 
manage it. The paper ends with a brief discussion. 



2. Some reasons for and against 'No common mechanism'? 



The idea that the evolution of characters in biology might be described by different sets of bran c h len gths 



underlies recent atte mpts to deal with phenomena such as heterotachy ([Gaucher and Miyamoto , 



2005 



Phillipe et al 



20051) . However, the idea dates back to th e early days of molecular phylogenetics. It is 



19711 ). and was discussed more explicitly 



impli cit in Walter F itch's discussion of a covarion model (jFitchl . 

bv lCavenderl (|198ll ) in reference to his simple two-state Poisson model. In response to the question of 



whether the probabilities of change should be the same for all characters, Cavender remarked: 
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"This assumption can and should be removed. It is unacceptable biologically because it says, for 
example, that an insect species is just as likely to lose (or acquire) wings as a spot of color." 



This comment seems reasonable for morphological characters, though even in that setting one might still 
expect some correlation in the relative probabilities of character change on a given branch across 
characters, as it may be more likely to observe changes on branches that correspond to long time intervals 
between speciation. It is less clear that Cavender's comment should apply to aligned DNA sequence sites, 
each of which we might view as a random samples from a common process. Nonetheless different DNA 
sequence sites may be subject to differing selection pressures and the probability that a site mutation 
becomes fixed in a population may depend on structural or functional constraints; for example, whether 
the protein a gene codes for still folds correctly if the substitution changes an amino acid. These 
constraints may vary with time and across the sequence, so enforcing an entirely 'common mechanism' 
model may be too severe. Similar comments apply to other types of genomic data that carry evolutionary 
signal. In linguistics, a model that allow ea ch character to have its own branch lengths has also been 



developed for studying language evolution (jWarnow et al 



2006) 



An additional reason why the No Common Mechanism (hereafter NCM) approach has received further 
attention is its relevance to those in the systematics commu nity who adv ocate the use of maximum 



Farrisl ([2008)). This has been justified by an 



parsimony (herafter MP) for phylogeny reconstruction (e.g. 
equivalence theorem that demonstrates that MP is the maximum likelihood (hereafter ML) estimator of a 
tree under a NCM mode l based on a symmetric Poisson process such as the Jukes-Cantor model 



(Tu fflev and Steel . 



Fischer and Thatte 



20041) . 



1997). A slightly more streamlined proof of this result has recently b een given by 



2009)) and extensions of this e quivalence theorem were d escribed in 



Steel and Pennvl (|2005f ). and, most recently, 



Fischer and Thattd (|2009| ). This last paper also 



Steel and Penny 



showed that the original equivalence theorem breaks down if one modifies the Poisson model slightly; either 
(i) by imposing a molecular clock, or (ii) by setting an absolute upper bound on the branch lengths. 



The significance and implications of the equiva lence between MP and ML e stimation under NCM ha ve 
arous ed considerable interest (see, for example, 



Soberl (120041) 



Farrisl (|2008f) . 



Huclscnbeck et al 



(2009 



2008)). One view is that NCM model is sufficiently general as to capture 't ruth' and so should be the 



model of choice, thereby providing a justification for maximum parsimony ([Farrisl ((2008)). An alternative 



position is that NCM is far too parameter rich and it ignores likely correlations between branch lengths 
due to shared time frames of speciation intervals. The NCM model required for the formal equivalence 
between MP and ML under the NCM is also based on a symmetric model of substitution change (such as a 
Jukes-Cantor model). Note that this model predicts (approximately) equal base frequencies, however a 
formal equivalence between MP and ML under the NCM model still holds if one regards the ancestral base 
in the tree at each site as a further parameter to be estimated (this would allow each site to have a 
'preferred' base, to reflect observed base composition in sequences). 



In the sections that follow our aim is not to defend NCM models, but rather to determine which methods, 
if any, would allow phylogenetic tree topology to be estimated in a statistically consistent manner were one 
to adopt various NCM models. 



3. ML ESTIMATION IN GENERAL AND IN PHYLOGENETICS 



In this section we consider a general setting that includes phylogenetic tree reconstruction, and other 
problems where a discrete parameter (e.g. a tree, network, cluster) is being estimated from discrete data 
(e.g. DNA sequences, genes) in the presence of unknown additional parameters. 

Suppose we have a sequence of observations u\, U2, . . . taking values in a finite set U (the elements of this 
set can be arbitrary, but we will call them 'site patterns' as we will usually be considering aligned DNA 
sequence sites). Suppose that these observations are generated independently by a model M that has a 
fixed but unknown discrete parameter a that takes values in some finite set A, alongside other continuous 
parameters which may vary from observation to observation. In the phylogenetic setting, A will generally 
refer to the set of fully-resolved tree topologies on a given set of species, and the continuous parameters 
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may refer to branch lengths or other aspects of the substitution model (site rate, transition/transversion 
ratio, shape parameter for a T distribution of rates across sites etc). 

In all such cases, Ui is generated by a pair (a,#i) where 9i lies in some set 0(a) which we will assume 
throughout is an open subset of Euclidean space. In the case of branch lengths on a tree this means they 
should be strictly positive but finite real numbers. 

3.1. CM and NCM versions of a model. In the Common Mechanism version of a model M, which we 
will denote by CM-M, it is assumed that all the 9i values are equal; that is, they take a common value, 

9 € 9(a). By contrast, in the No Common Mechanism version of M, which we will denote by NCM-M, the 
9i can take different values. Notice, however, that if these 9i values are assigned randomly and 
independently from some common distribution (as is the case with most 'rates across sites' models in 
phylogenetics) then this is just a CM version of a slightly more complex model M* that is derived from M . 

3.2. Maximum Likelihood under CM and NCM. The ML estimation of a discrete parameter from A 
under an NCM version of M applied to data (u 1: . . . ,Uk) selects the element b e A that maximizes 

(1) L(6|data) := sup P[data|6, (0 1( . . . , 9 k )] = TT sup P[u*|&, 9], 

(_e u ...,e k )ee(b) k fj^eee(b) 

where 'sup' in ([1]) refers to supremum (the maximum value either obtained or as a limit). The second 
equality in {!]) is justified by the assumption that the observations are independently generated by the 
model. For ML estimation under the CM version of M, the only difference is that the 9t values are 
required to be identical (i.e. 9i = 9 for all i). 

Given two models M\ and (usually, but not necessarily the same model), we refer to ML estimation 
under M\ applied to -data as the ML estimation under model M\ of the discrete parameter in A from 
data that has been produced under model M<x. We are interested in determining when this method is 
statistically consistent (defined shortly) for various Mi, M2, and if so, what can be said about the sequence 
length requirements for accurate estimation. 
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Given two models Mi and M2, we write M\ C M2 if M\ is a submodel of M2, that is, M\ is a special case 
of M2 once constraints are placed on its parameters; in particular, for any model M, CM-Af C NCM-M. 
If M2 is not contained in M\, the ML estimator is often said to be carried out under a 'mis-specified model' 
- in this case, we do not generally expect consistency so we are usually more interested in the regular case 
where the model in which ML is performed includes the model that generates the data, that is, cither 
Mi = M2 or M2 C Mi (one exception occurs in Theorem QJiv) which provides an instance where ML 
estimation under a CM model applied to a NCM version of that model is statistically consistent). 



3.3. Basic models for character evolution (N r ). It will be convenient to describe most of our results 
for a particular model of character evolution. The simplest such model assumes that the rates of 
substitution between each pair of the r character states are equal - this is sometimes referred to as the 
Neyman r-state model or the 'symmetric r-state model'; here we call it the iV r -model (after the Neyman 
r-state model). In the special case where r = 4, this is the familiar Jukes-Cantor model, while r = 2 is 
often referred to as the 'Cavendar-Farris-Neyman (CFN) model'. In the N r model, it is usually (but not 
always) assumed that the frequency of bases at the root of the tree is the uniform distribution. 



We will also consider the limiting case of the N r model as r becomes large (for a given number of species). 
This model, denoted here by Nob, is sometimes called the ' Kimura-Crow infinite a lleles model' 



( Kimura and Crow 



20041 1 . and it models the 



19641 ) or the 'Random Cluster model' (jMossel and Steell . 
setting where each substitution always results in a new state. We will denote the CM and NCM versions of 
N r model (r being either finite or infinite) by writing CM-iV r and NCM-7V r , respectively. 



3.4. SIN for data generated under Common Mechanism (CM) models. In the 

'common-mechanism' (CM) model - either for generating the data or for carrying out ML - we require the 
9i values to all be equal to some common value (call it 9) . Note that even if we are not at all interested in 
estimating the 9 values, we often still have to consider their role in any probability calculations; in this 
case, they are said to be 'nuisance parameters'. 
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A method M. for estimating the discrete parameter in A from a sequence of independently generated 
observations is statistically consistent for data generated under a CM model if, for each a £ A, and 
9 G 6(a), the probability that M correctly estimates a from (ui, . . . , Uk), when each observation Ui is 
generated independently by the model with parameters (a, 0), converges to 1 as k grows. If this condition 
fails, the method M. leads to statistical inconsistency (SIN). A related, but slightly different concept of 
statistical consistency exists, based on the strong (rather than the weak) law of large numbers, but we do 
not discuss it here. 



Two types of SIN are possible in inferring phylogenetic tree topology. The more familiar and stronger form 
is when the method M. can 'positively mislead' - that is, the probability that the method estimates an 
incorrectly resolved tree converges to 1 as the sequence lengt h grows; this is t he type of inconsistency that 



occurs with maximum parsimony in the 'Felsenstein Zone' (jFelsenstein 



1978|). 



However, a milder form of SIN occurs if, with increasing data, the method becomes unable to decide 
between the true tree and alternative trees. This can occur if the method returns a non-resolved tree, of 
which the true tree is just one resolution, and the probability of returning such a non-resolved tree from 
data generated under the CM model tends to 1 as the sequence length grows. Precisely such a situation 
has recently been shown to occur with 'Ancestral Maximum Likelihood' (AML). In a maximum- likelihood 
framework this method optimizes not just the tree topology an d its branch lengths but also a particular set 



Mossel et al 



(|2009l ) showed that this AML 



of ancestral sequences, and then returns just the tree topology, 
estimation of tree topology applied to CM-/V2 data can converge on the fully-unresolved star tree, when 
the branch lengths of the fully resolved generating tree are in a certain range. Whether AML can lead to 



the stronger form of SIN of being positively misleading is currently an open question. 



Note that this milder form of SIN is quite different from not having sufficient data to resolve a tree 
topology (a much more familiar problem for biologists) - we deal with this latter issue in Section [5] By 
contrast, mild SIN requires that the tree will never become fully resolved, no matter how much data we 
were to obtain. 



3.5. Topological aspects of statistical estimation. We now describe two conditions ('Identifiability' 
and 'Kissing') that make accurate estimation of the discrete parameter a S A simultaneously possible and 
problematic in the following sense: Given 'enough' data we can be sure to reconstruct a correctly, but we 
cannot say in advance how large 'enough' will be. These two conditions typically hold in the reconstruction 
of fully-resolved phylogcnctic trees as well as other related problems. To describe the conditions we need 
two further definitions. 

Given the model parameters (a, 9) let P( a ,e) denote the associated probability distribution on site patterns, 
and let p(Q(a)) {p( a ,e) '■ # G ©(a)}, which is a subset of the \U\ -dimensional simplex of probability 
distributions on the set U of site patterns. Also, given a subset A of Euclidean space, let A denote its 
(topological) closure. We can now state the two conditions: For all a, b S A, with a^b consider the 
following: 

Identifiability condition: 

(2) p(©(o)) n p(Q(b)) = 0, and 
Kissing condition: 

(3) pW^npWUj)^®. 

In the phylogenetic setting, where we will often regard © as branch lengths, p(Q(a)) will be all the 
probability distributions we can obtain on site patterns by varying the branch lengths on the binary tree a 
over all strictly positive but positive values. The set p(Q(b)) includes not just all probability distributions 
we can obtain on site patterns by varying the branch lengths on the binary tree b over all strictly positive 
but finite values, but also the limiting distributions as branch lengths tend to zero or to infinity (in all 
possible combinations). We provide an example (and figure) to illustrate these concepts after some brief 
remarks. 

In general, the identifiability condition ([2|) alone is sufficient to ensure that maximum likelihood in the CM 
setting will consistently reconstruct each discrete parameter in A when the observations are generated 
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under a common mechanism (jChane 



1996 



Stee l and Szekelv 



2009). The condition holds for many models 



in molecular systematics, including the general Markov model, a simple Covarion model, and models that 
exhibit low-parameter rate variation across sites, such as the 'GTR+r' model. An outstanding unsolved 
problem is whether the widely-used 'GTR+r+F model satisfies the i dentifiability conditi on if both the 



shape parameter and the proportion of invariable sites are unknown (jAllman et al 



2008). Shortly we will 



describe some models for which the identifiability condition has been shown to fail. 



The kissing condition © is also relevant to phylogenetics, indeed it applies to all models of character 
evolution used for inferring tree topology. Any two different trees can produce identical data if the lengths 
of the interior branches on which the two trees differ shrink to zero and/or the lengths of all (or 'most') of 
the pendant edge s grow s to i nfinity ('site satura t ion'); these phenomena reflects the geometry of tree-space 



discussed in 



Kiml pOOOl ) and 



Moulton and Steel (|2004J l where quite different trees can come arbitrarily 



close together ('kiss') in terms of the distribution on site patterns they can produce. This means that the 
sequence length required to reconstruct a tree correctly by any method, tends to infinity as the interior 
branches shrink in length, or as the pendant ones grow. 



Note t hat this 'tree-space' is related to, but quite different from, the tree space described by 



Billera et al 



(|200l|) - for example, the latter tree space regards two trees of different topologies as becoming infinitely 
far apart as we grow the length of all their branches; however in the tree space here, we regard them as 
becoming closer together since they are tending to produce exactly the same data (random sequences). 



This is illustrated in the following example. 



Example 3.1. We can visualise conditions ((2]) and (J3j> by means of a simple but instructive example. 
Consider the three rooted binary trees on leaf taxa 1,2,3, which have branch lengths that satisfy a 
molecular clock. Let L denotes the length of the interior edge length, and I the length of the shorter 
pendant edge length, so L + 1 is the length of the longer pendant edge length. For the tree a\ = 1|23, the 
set 0(ai) is the infinite open first quadrant of the plane: {(l,L) : L,l > 0}. 
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Figure 1. The 'paper-dart' representation of tree space for a Markov process on three taxa 



subject to a molecular clock (see text for details). 



Now consider the function that assigns to (l,L) the probability distribution on site patterns under some 
model. For simple models, such as the N r model, this function can be described as the composition of two 
continuous one-to-one and onto functions. The first map associates (l,L) with the vector 
(x, y) = (e , e~ cL ) where c is a fixed constant (dependent on the model). The image Pi of this map is the 
open square (0, 1) x (0, 1) (Fig. 1(a)). The second map £ : Pi — > [0, l]' u > sends (x,y) to a probability 
distribution on site patterns that is determined by the branch lengths associated with the pair (x,y). 

For the N% model, with a uniform probability on the two states at the root, we have eight site patterns, 
(x, y) = (e~ 2i , e~ 2L ) (i.e. c = 2), and the eight components of £(x, y) take just three different values 
according to whether (i) all three leaves are in the same state, (ii) leaf 1 is in the same state as just one of 
the other two leaves, or (iii) leaf 1 is in a differe nt state to the other two leaves. Using standard Hadamard 



representation (see e.g. ISemple and Steell (|2003f )) these three probabilities, which apply for two, four and 



two site patterns, respectively, are: 4(1 + x 2 + 2x 2 y 2 ), 4(1 — x 2 ), 4(1 + x 2 — 2x 2 y 2 
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Moreover, the map £ extends to Pi (the closure of Pi which is the closed square as shown in Fig. 1(b)) and 
p(0(ai)) = £(Pi). Similarly, for each of the other two trees 0,2 = 2 1 13 and 03 = 3 1 12, we have: 
p(Q(a,i)) = £(Pi), i = 2,3 where P2 and P3 are the corresponding closed squares for the other two trees 
(Fig. l(c,d)). Each point on the top boundary of these Pi (corresponding to L = 0) has corresponding 
points on the top boundary of P2 and P3 that induce exactly the same probability distribution on site 
patterns, and so these three regions 'kiss' at each such point (one such shared point is indicated in each 
region in Fig. 1 (b,c,d)). Thus we can identify (or glue) these three squares along their top boundary (Fig. 
1(e)). Finally, any point on the front boundary (corresponding to I = 00) leads to the same probability 
distribution on site patterns - for the N r model, this would simply assign each of the possible site patterns 
equal probability. Thus the whole of this y-shaped part of the complex in Fig. 1(e) is identified to a single 
point, resulting in the final 'paper-dart' representation of the tree space shown in Fig. 1(f) (an example of 
a 'CW complex' in topology). 

The main point about this complex is that a one-to-one correspondence can be seen between the points on 
the 'paper dart' and the probability distribution on site patterns that can be induced by 3-taxon trees 
under a molecular clock where the edge lengths can vary from to (actual) infinity. Note that the 'spine' 
of the dart corresponds to the unresolved star tree, while the 'head' of the dart corresponds to pendant 
branches of infinite length. The 'identifiability' condition |J2J) is satisfied since the interior of any one of the 
three wings does not intersect any other wing (even at the boundary of that other wing), while the kissing 
condition Q holds since the wings all touch each other along the central spine (and also, for a different 
reason, at the front tip). 



3.6. Failure of identifiability for certain CM models. Note that the identifiability condition 
generally applies to simple models of site substitution in phylogenetics when A is the set of fully-resolved 
(binary) phylogenetic trees. However, it can collapse in three important cases: The first is if we enlarge A 
to the set of all phylogenetic trees (binary and non-binary) on a given set of leaf taxa, since if a has a 
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polytomy, and b is a tree obtained by resolving that polytomy, then: 



p(e(o))np(e(6))=p(e(o))^0. 



Indeed, even under the CM model, the reconstruction of general (including non-binary) trees using 
maximum likelihood will not be statistically consistent, since if the generating tree is non-binary, the ML 
tree will typically be a resolution of this tree for any sequence length (though the lengths of t he bran ches 



that resolve the polytomy will converge to zero in probability as the sequence length k grows (jChang . 



19961 )). It is possible to consistently reconstruct general (including non-binary) trees, cither by modifying 



ML so as to collapse e dges whose length is less than, say log(fc)/vfc or by adopting alternative approaches 



to tree reconstruction (jGronau et al 



2008). 



The second situation where the identifiability condition ([2|) may collapse (even when we restrict A to be a 
the set of fully-resolved trees) is when we have a p hylogenetic mixture for certain models. In this case not 



only can ([2]) fail, but the examples constructed in (jMatsen et al 
model satisfy the stronger violation condition: 



(|2008l )) for two-tree mixtures under the 



(4) 



p(0(a))np(0(6))^0. 



In the setting of (jMatsen et al 



(2008)), 0(a) refers to all triples (A, A',p) where A and A' are assignments 
of positive but finite branch lengths to tree a, while p (respectively 1 — p) is the probability that the site 
evolves under the first (respectively second) set of branch lengths. Thus ([4]) describes the situation in 
which two fully resolved trees of differing topology can induce exactly the same probability distribution on 
site patterns under their particular mixture processes. To better visualize this, note that in our 'paper 
dart' example earlier, condition (j4]) does not occur, but if it did, its geometric interpretation would be that 
an interior portion (or point) of one 'wing' of the paper dart gets glued to a portion (or point) of a different 
'wing'. 



A third situation where the identifiability condition (J2J) may collapse is w hen there is a distr ibution of rates 
across sites with two many unknown parameters. Indeed, it was shown in lSteel et al.l (|1994l ). that an even 
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stronger violation than ((U is possible, namely all fully resolved trees on a given set of species can induce 
the same probability distribution on site patterns for appropriately chosen (but positive and finite) branch 
lengths in an N2 model and distributions of rates across sites- in other words: 



aeA 



where 0(a) is the set of branch lengths and the parameters describing the distribution of rates across sites. 



In cases where the identifiability condition © fails (i.e. when p(Q(a)) n p(Q(a)) 7^ for some pair a ^ b), 
for example when a model is 'over-parameterized', we have a useful distinction based on whether or not the 



overlap of p(Q(a)) and p(Q(a)) has the same or smaller dimension than p(Q(a)). In the latter case, 
although the model fails to satisfy the strict identifiability condition, it fails only on a subset of 0(a) of 
zero relative volume - in this case, the tree topology is said to be generically identifiable under the model. 
The distinction between generic and strict identifiability is important for trying to decide whether SIN is 
'theoretically possible but unlikely to occur in practice' or whether there is a reasonable chance of being in 
a region of parameter space where we might be unable to distinguish between two competing trees, even 
from infinite data. The distinction has often been overlooked in earlier studies, but is carefully discussed 
now, particularly as generic identifiability is a notion that sits much more comfortably with current 
mathematical methods for studying the properties of Markov models b ased on algebraic geometry and 



phylogenetic invariants (jAllman et al 



2008; 



Allman and Rhodes 



2008). 



4. SIN in the No Common Mechanism (NCM) setting 

In the NCM model, the 6i values may vary in some unknown way. In particular, we do not assume they are 
selected i.i.d. from some distribution. By analogy with the CM setting, it is tempting to extend the 
definition of statistical consistency of a method M. to the NCM setting by the following slight modification: 
"For each ngi, and sequence 9i € 6(a), the probability that M correctly estimates a from (u%, . . . , 
when each Ui is generated independently by the model with parameters (a, converges to 1 as k grows." 
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However, this condition as stated is too strong: in the tree setting, if the branch lengths grew to infinity (or 
shrank to zero) sufficiently fast with each observation then accurate tree reconstruction by any method for 
any model can be ruled out (by the Kissing condition). Nevertheless there are meaningful notions of 
statistical consistency in the NCM setting, which generalize the CM definition. Recalling that 9(a) is an 
open subset of Euclidean space, and that a compact subset of Euclidean space is any subset that is closed 
and bounded we will consider the following: 

Definition: We will say that a method M for reconstructing a discrete parameter in a finite set A is 
statistically consistent for data generated by a NCM model if it satisfies the following condition: 

For each a £ A, and every compact subset C of 0(a), the probability that M correctly estimates a 
from • • • , Uk), when each Ui is generated independently by the model with parameters (a, 0,), 
where e C, converges to 1 as k grows. 

This definition is equivalent to the definition of statistical consistency under the CM version of the model if 
we further insist that all the 9i values are equal, and in this case the choice of the compact set C can be 
restricted to single points in 9(a). 

Note also that when we perform ML estimation under CM or NCM we do not require that the 9i values 
associated with a lie within any given compact subset C of 0(a); they can take any value in 0(a). 

4.1. Which phylogenetic models and methods can lead to SIN?. The following main result shows 
that the issue of statistical consistency under NCM is a delicate one, depending on the details of the model 
and the method. The full mathematical proof of the following results is provided in the Appendix. 
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Theorem 1. 

i. [Inconsistency of ML for NCM— N r model] Maximum likelihood estimation of fully-resolved 
tree topology under the NCM-N r model applied to NCM-N r data (or even CM-N r data) is 
statistically inconsistent for any finite r > 1 . Moreover, no tree reconstruction method is 
statistically consistent for NCM-N 2 data. 

ii. [Consistency for the NCM— N4 model] In contrast to part (i), there is a statistically consistent 
method for inferring fully-resolved tree topology from data generated by an NCM Jukes-Cantor 
model. 

iii. [Consistency for NCM models with a molecular clock] Neighbour- joining on uncorrected 
sequence dissimilarity is a statistically consistent method for inferring fully-resolved tree topology 
from data generated by an NCM model where each site evolves under its own General Time 
Reversible (GTR) process (with its own strictly positive rate matrix and branch lengths) provided 
that, at each site, the branch lengths are clock-like on the generating tree. 

iv. [Consistency of ML for NCM— Noo model] Maximum likelihood estimation of fully-resolved 
tree topology under the NCM-N^ ( or even under the CM-Noo model) of NCM-N^ data is 
statistically consistent. 

5. Measuring SIN, and taking precautions against it 

We have provided a topological view of tree reconstruction, but there is an equivalent metric view. To 
explain this, take any continuous distance function d on probability distributions on U (the set of site 
patterns). For example, we might take the variational distance defined by d(p, q) — \ X^ec \p( u ) ~ <l{ u )\- 
An alternative way of expressing the identifiability and kissing conditions (@ and Q) is to require, for all 
a, b G A, with a =/= b: 



(5) 



inf d(p, a m,p, bte >)) > 0, and 
0'e0(fc) 



17 

respectively, where 'inf refers to infimum (the minimal value achieved or in the limit). These are identical 
conditions to |J2J) and ([3]), respectively, by standard arguments from analysis, based on the 
Bolzano- Weierstrass Theorem. 



One advantage of this metric viewpoint is that a strictly positive value in ([5]) not only tells us that ML is 
consistent in the CM setting, but also the magnitude of this value sets explicit upper and lower bounds on 
how much data (sequence length) we req uire in order to reconstruct the discrete parameter (tree) 



accurately (| Steel and Szekelv 



2002 



20091) . The more closely a tree with appropriately chosen branch 
lengths (or other parameters) can fit the probability distribution on site patterns of a different tree, the 
more sequence sites it will take to tell which of the two trees produced the data. 



The sequence length required for accurate tree reconstruction (under any CM model) also depends on the 
number of species being c l assifie d. Quantifying this relationship is particularly challenging mathematically 



(see e.g. 



Daskalakis et al 



(2009)). Various optimal or near-optimal results have been established, which 
usually require developing a new and clever tree reconstruction method (not because such methods are 
necessarily better than ML estimation but rather because it has been difficult so far to rigorously establish 



good bounds on the sequence length required for ML to reconstruct a large tree accurately). 



Much less is known about the sequence length requirements for tree reconstruction under NCM models. In 
the case of the Noo model (Theorem 1, Part l(iv)), the sequence lengths required for accurate tree 
reconstruction from data generated by an NCM version of the model are essentially the same as for the CM 
version of the model, provided that in both models, we insist that all edge lengths are bounded between 
(r, s) where < r < s < h . However it seems entirely possible that for a finite-state N r model such as the 
Jukes-Cantor model, the sequence length required for accurate tree reconstruction from NCM— N& data will 
be much larger than for a CM— N& model with comparable upper and lower bounds on the branch lengths. 
If so, this would be another example of where the two models (finite-state versus infinite-state) have quite 
different statistical properties. Two other examples are: 
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• The sequence length required to resolve a short interior branch of the tree of length e (from the two 
alternative tree topologies obtained by swapping branches across the edge) gro ws at the rate 1/e 2 



for the finite-state model but just at 1/e for the infinite state model, as e — > ([Mossel and Steel . 



2004; 



Stee l and Szekely 



2002) 



• The sequence length required to determine which of two resolved trees, that classify the same n 
species, generated the sequence data must gr ow under the finit e-state model, but can be 



independent of n for the infinite state model (| Steel et al 



2009). 



Returning to the NCM-N r model of iTufflev and Steel (|1997l ). where branch lengths are allowed to vary 
freely from site to site, one attempt to avoid the massive over-parameterization of this model is to assume 
that these different branch lengths (between characters and across sites) are assigned randomly (i.i.d.) 
according to some comm on fixed prior distribution. This 'Bayesian NCM' model was explored by 



Huclscnbeck et al 



(2008). As the authors noted, this model has an interesting property: if one has an 
underlying N r model then this 'Bayesian NCM model' induces exactly the same probability distribution on 
site patterns as what might be called the 'Ultra-common mechanism model' (UCM), where each character 
has the same branch length, and these branch lengths are the same across the tree. This formal equivalence 
between such a tightly constrained model (which would never be used in ordinary phylogenetic practice) 
and a type of NCM model seems at first a little paradoxical, until it is realized that the assumption of 
common fixed prior distribution on branch lengths (across the characters and across the trees) is a very 
strong 'commonality assumption'. The formal equivalence becomes only approximate for more complex 
substitution models (technically, this is the result of the rate matrix having multiple nontrivial eigenvalues). 



Wu et al 



( 20081) . In 



A related but less constrained version of this Bayesian NCM model was developed by 
this model, the tree has underlying branch lengths that are common across the characters, but can vary 
across the tree, and each site has an intrinsic rate which multiplies the branch lengths across the tree. But 
in contrast to O lsen's model of allowing this per-site rate to be a free parameter (to be optimized in ML) 



Wu et al 



(|2008l ) assume that this rate parameter is selected i.i.d. from a fixed distribution of rates across 



sites. For this model, when the underlying substitution process is (say) a Jukes-Cantor model, this 
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intermediate-level Bayesian NCM model assigns exactly the same probability distribution on site patterns 
as a model in which all the sites evolve i.i.d. under a Jukes-Cantor model (with no rate variation across 
sites). As with the UCM model, the formal equivalence becomes only approximate fo r more complex 
substitution models, and for the same reasons. However, in this more general setting. IWu et al 



2008) 



showed how a 'log-det' transformation gives a statistically consistent way to establish the tree topology 
from data generated under this model. 



The statistical properties of NCM have also been investigated recently from standard model-selection 



approaches such as AIC (jHuelsenbeck et al 



20091 ) . It is clear that NCM can confer higher likelihood scores 
than a CM model for any data, since one has so much flexibility when choosing the nuisance parameters to 
get a good fit. Model selection techniques such as AIC (as well as BIC, and other variants) penalize models 
that are too parameter -rich by subtr acting from the log-likelihood of the model a term that depends on the 



number of parameters (jAkaikeJ, 



19731 ). Under this criterion it seems unlikely that the full-blown NCM 
model will ever be favored over CM models under AIC. However it is not entirely clear that the conditions 
required to justify the AIC criterion extend rigorously to this NCM setting. 



5.1. Can SIN still occur if we enforce a 'no Kissing' condition? We conclude this section by 
pointing out that the Kissing condition ([3]) (or, equivalently ([5])) is not necessarily the cause of SIN in the 
NCM-setting. To see this, suppose we constrain the N r model so that all the branch lengths in a tree lie 
between e and — log(e) for some e > 0. Let us call this the iV^ model. Then, under the JV* model we have, 
for different resolved phylogenetic trees a, b on the same set of species: 



(7) 



inf „d(pr ag ),pr be ,))>q>0 



where q = q(e) converges monotonically to as e — + 0. Consequently, for e > the Kissing condition fails - 
topologically different trees cannot 'look' arbitrarily close through the eyes of data produced under a CM 
model. However ML estimation under the NCM-iV^ model, can again be statistically inconsistent. Indeed 
suppose we take any tree and branch lengths in the interior of a Felsenstein Zone for that tree (i.e. a set of 
branch lengths where maximum parsimony would converge on an incorrect tree for data produced under 
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the CM- AT* (or NCM-A^ model). Then we can chose e > small enough so that the ML estimate of the 
tree under the NCM- AT* model converges on wrong tree when applied to the CM- AT' data produced from 
the original tree with its Felsenstein Zone branch lengths. The formal proof of this claim is given in the 
Appendix. 



6. Concluding comments 



Making molecular phylogenetic models more 'realistic', and thereby capturing more of the complexities of 
how DNA evolves - across the genome and over different time scales - usually requires introducing a 
number of adjustable parameters. If these parameters can be independently estimated from other data, or 
if they enter into the model in ways that are not problematic for tree inference (as in Theorem l(ii- iv)), or 
if they follow some common distribution that is described by few if any unknown parameters), then 
statistically consistent inference of tree topology is achievable. However, in general, treating branch 
lengths, and other model parameters as unknown quantities can drive reconstruction methods to SIN. 

Theorem 1 (parts (ii— iv)) provides no real endorsement for NCM models, but it shows that sweeping 
assumptions that such models must necessarily lead methods to SIN are incorrect. Such arguments 
typically proceed as follow: in NCM models the number of nuisance parameters grows with k and we are 
unable to estimate them with an y precision, thus the usual conditions that suffice for the consistency of 



maximum likelihood estimation (Wald 



1949) fail and so the method will be inconsistent. All but the final 
conclusion of this last sentence are correct - the failure of a sufficient condition for a statement to be true 
is not sufficient for it to be false! Indeed, Theorem 1 provides specific cases where NCM-maximum 
likelihood estimation under is consistent for certain NCM models. 

Even when statistically consistent methods exist for an NCM model, it is still possible that ML can be 
statistically inconsistent (this contrasts with what happens in the CM setting, where ML is generally 
consistent if any consistent method exists). This leads to a somewhat uncomfortable position for those who 
wish to provide some statistical justification for the use of maximum parsimony as the ML estimator under 



the NCM model of iTufflev and Steel (|1997l ) - By Theorem [Hi) , such a method lives in a state of SIN, yet 
this could be avoided for this NCM model if one were to renounce maximum parsimony in favour of a quite 
different method, such as one based on linear phylogenetic invariants (the method used in the proof of 
Theorem QJii)). 



However, this is no strong argument in favour of linea r invariants , as they tend to be very inefficient in 



1994b). A method based on linear 



their ability to extract phylogenetic signal from data (Hill is et al. 
phylogenetic invariants may be guaranteed to converge on the right tree eventually, even under the NCM 
model, but this may require an astronomical amount of data. By contrast, methods such as maximum 
parsimony appear to be quite efficient at extracting phylogenetic signal when the genera ting tree bra nch 



1996). 



lengths are some way from those portions of parameter space that lead to inconsistency (Hillis, 
Thus, although statistical consistency is desirable, it should not over-ride all other considerations - for 
example, a powerful method that is consistent in most regions of parameter space would generally be 



preferred over a statistically consistent method that may requires huge amounts of data to converge. 



Of course many of this results in this paper are confined to very simple models (such as the Jukes-Cantor); 
we have chosen to do this for two reasons: firstly, they are sufficient to demonstrate that even with very 
simple models, all possibilities (consistency and SIN) are possible given slight tweaks of the assumptions or 
method; secondly, the analysis of more complex models is beyond the scope of this paper, but would be a 
worthy objective for future work. 



In summary, the question of whether one is prone to SIN by adopting a particular NCM model and a 
particular method of inference has a more complex answer than in the CM setting - i t depends subtly o n 



the de tails of the model and on the method. The full-blown generality of the NCM of 



Tufflev and Steel 



(|19971 ) is unnecessarily over-parameterized for most data, being a model that was developed to prove a 
formal equivalence between methods rather than as a model of choice. Far from being a justification of MP, 
its plethora of ever-growing parameters would surely have not seemed 'parsimonious' to William of Occam. 
At the other extreme are simple attempts to include NCM within a Bayesian framework; these avoid SIN, 



22 

but at the price of forcing the NCM model into a CM straight jacket by viewing the parameters as samples 
from a common underlying prior. Between these extremes there would seem to be an endless variety of 
possibilities. The development of carefully constrained yet parameter-rich models, guided by model 
selection criteria, and which recognize that characters evolve under different processes dependent on their 
biochemistry, will surely play a significant role in future phylogenetic methodology. 
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8. Appendix: Technical details 



The following Lemmas are required in the proof of Theorem 1. 



Lemma 8.1 (Azuma's inequa lity). There are several variants of this inequality (see for example 



Grimmett and Stirzakei 



200i) ) here we give a special case of a more general version. Suppose that 
X\, . . . , Xk are independent random variables taking values in some set S , and that Y = f{X%, . . . , Xk) 
where f : S — > K is any function with the property that \f(yi, ■ ■ • , yk) ~ f{y'xi ■ ■ ■ > Vk)\ ^ c whenever y[ = yi 
for all but one value of i. Then P(Y — E[Y] > x) and P(Y" — E[Y] < —x) are each less or equal to 
exp(— x 2 /2c 2 k). 

Lemma 8.2. A method M. for estimating the discrete parameter a € A from sequences of observations in U 
is statistically consistent under a NCM-M model if it satisfies the following property: for each a S A, there 
is a nest family 0fc(a), A; = 1,2..., of increasing open subsets of Euclidean space with 0(a) = UfcLi ©i( a ); 
so that the following condition holds: the probability that M correctly estimates element a from 
(ui, ■ ■ ■ whenever each m is generated by (a,9i), with 9i £ 0fe(a), converges to 1 as k grows. 

Proof. Suppose C is a compact subset of 9(a). Then C n 0/s(a), i > 1 is an open cover of C. Since C is 
compact, C is equal to the union of finitely many of the sets C D Ofc(a), and since the sets 0jt(a), k > 1 are 
nested, a value of A; = ki exists for which C C 0fe 1 (a). By the hypothesis of the Lemma, the event that M. 
correctly returns any a from (m, . . . , Uk) when each m is generated by (a, 9i), where 9i G 0fe(a), has 
probability that converges to 1 as k grows. Since C C C for k > k\, restricting 9i to lie in C 
ensures that the probability M correctly returns a from (u±, . . . , it/.) when each itj is generated by (a, 9i), 
where 9i € C also converges to 1 as k grows. 



□ 



8.1. Proof of Theorem 1. Part (i): ML estimation under the NCM -iVy model applied to an y sequence 
of r-state characters returns the same tree(s) as maximum parsimony (|Tufflev and Steell . 1 19971 ) . This latter 
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method was shown to be statistically inconsiste nt for GM—N2 data 



for CM-jV,. data for r > 2 by later authors (see 



(jFelsenstein 



19781 ) and, more generally, 



Schulmeisterl ((2004), and the references therein) even for 



trees on four species. Since the CM-TV,, model is just a submodel of the NCM-iV r model, both assertions in 
the first claim of Part (i) follows. Specifically, we can take a to be any resolved binary tree on four leaves, 
and 9(a) to be (0, oo) 5 and select the 9i values all to be equal to a choice of branch lengths 9 E (0, oo) 5 for 
which maximum parsimony (and thereby ML under NCM-7V r ) converges on an incorrect tree. 



The proof of the second claim, that concerning the NCM-7V2 model, follows directly from the examples in 



Matsen and Steell (j2007T ). 



For parts (ii-iv) we will establish the statistical consistency of various methods by establishing the 
property described in the following lemma. 



Part (ii): The proof relies on the existence of certain linear phylogenetic invariants for the Jukes- Cant o r 



model (the existence of such invariants for models that include the Juk es-Cantor was desribed by 



Lake 



(|l987l )). In particular, from Theorem 1 (part 5) of 



Steel and Ful (|1995f ). any binary phylogenetic tree T has 



an associated function Lt of the site pattern frequencies, such that (i) £t(p) = where p is the 
probability vector of site patterns generated by T under any assignment of branch lengths, and (ii) for any 
binary phylogenetic tree T 1 that is different from T, but has the same leaf set, we have 
Lt'(p) > It{u,v) > where u is the shortest branch length, v is the largest branch length, and / is a 
continuous function that has the following two properties: 



• for all u > 0, / is monotone decreasing in v and converges to as v tends to infinity; 

• for all v > 0, / is monotone increasing in u and converges to as u tends to 0. 



Although these are all the properties of / we require for the rest of the proof we provide an explicit 
description of / is provided as follows: 



2,x 



For a binary tree T on four leaves (a 



Jukes-Cantor model was described in 



quartet tree) with t opology ij\kl, a linear invariant Lt for the 



Steel and Ful (| 1995T ) with the properties described in the proof of Part 



(ii) of Theorem[TJ with: f(u, v) = exp(— ^S) ■ (l — exp(— where S is the sum of the lengths of the four 
pendant edges, and I is the length of the interior edge. For a binary tree T with more than four leaves, 
select any collection Q of quartet trees that define T (i.e. for which T is the unique tree that displays those 
quartet trees) and take the sum of the linear invariants just described to give a linear invariant Lt- Notice 
that Lt also satisfies the condition described in the proof of Part (ii) of Theorem Q] by taking f(u, v) to be 
the function: exp(— |(2n — 4)v) ■ (l — exp(— |u)) where n is the number of species (so the number of edges 
in the pendant edges of any induced quartet tree is at most (2n — 4)). This is the function / promised. 

Returning to the proof of Part (ii), for a site s, let X s be the 4™-dimensional vector, indexed by site 
patterns, that takes the value 1 for the site pattern observed at site s and otherwise, and let 
x( fc ) = i Yl s =i ^- S - Consider the following tree reconstruction method 
M. : Select the binary phylogenetic tree T that minimizes Lx(x'"). 

We will show that M. is statistically consistent under a NCM-iYj model, by ensuring that the branch 
lengths in the generating tree at site i lie between ej and Li, where these two sequences converge 
monotonically to zero and to infinity (respectively), sufficiently slowly with i. 

To this end, suppose k sites evolve on a fully-resolved tree T under a NCM-A^4 model. Let p s = E[X S ], the 
vector of probabilities of the different site patterns at site s, and let := \ S s =i P S - By the invariant 
property of Lt, we have Lt(p s ) — for all s, and since Lt is linear, it follows that: 

(8) E[L T (xW)] = L T (E[xW]) = L T (p (fe) ) = for all k > 1. 



Similarly, for any fully-resolved phylogenetic tree T 1 that is different from T, but has the same leaf set, we 
have: 



(9) 



E[L T ,(x«)]=L T ,(p( fe ))>/( efc ,L fe ). 
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From the continuity of / and its other listed properties, we can allow e k to tend to zero and Lk to tend to 
infinity sufficiently slowly (with increasing k) that the following condition is satisfied: 



(10) lim k-f 2 {e k ,L k )^oo. 

k — >oc 

Now, since the X s : s = 1, . . . , k are independent random variables, the Azuma inequality combined with 
([5]) and (fTU|) gives: limfc^oo P (Lt(x^- 1 ) > ^f(ek,Lk)) = 0; while for any alternative fully-resolved 
phylogenetic tree V (^ T), Eqns. ® and (10]) give: lim^^ P (i T <(xW) < \f{e k ,L k )) = 0. Combining 
these two last equations gives: limfc^oo P (L^x^) < Lt^x^)) = 1, and so 

lim P (L T (x (fe) ) < L T >{± (k) ) for all T' ^ t) = 1. 

k — >oo V / 

By Lemma \8. 21 this implies that method M. is statistically consistent under the model described. 

Part (Hi): Let <& denote the proportion of the k sites on which species i, j disagree and let fi\^ = E[d[^}. 
Thus d<-f = i J2s=i C j where C 3 takes the value 1 if 

sequences i and j differ at site s, and otherwise. By 
the standard theory of reversible r-state Markov processes, combined with the molecular clock hypothesis, 
for any two species x, y, we can write 

r r — 1 

(11) nm = a-£ <<) + E ,je- 2 ^ 

where: 

• 7r Si i is the vector of equilibrium base frequency of base i at site s; 

• —fi s ,j ar e the non-zero eigenvalues of the GTR rate matrix at site s; 

• the coefficients a s j are positive (and determined by the eigenvalues of the GTR matrix at site s, 
along with the tt s ^ values); 

• t xy is the time from when species x and y diverged in the tree to the present. 
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Consequently, if the generating tree T resolves the triplet of species i,j, I as the rooted tree ij\l then: 

(12) E[£ l ] = E[g' ] > Eg?], and E[^] - E[#'] > 9 (u s ,v s ) 

where g is a continuous function that has the same properties as / in the previous proof and where: 

• u s is the sum of the branch lengths on the path between the least common ancestor of i, I and the 
least common ancestor of i,j at site s times the substitution rate at site s; 

• v s is the sum of the branch lengths between the root and any leaf multiplied by the largest 
magnitude of any eigenvalue of the GTR matrix at site s. 

An explicit description of the function g is as follows: From we have 

E K'] " = E^i 1 <*sd (exp(-2^ > t ii ) - exp(-2Pj t „tij)) , and, using the identity 

e~ x — e~ y = e~ v (e y ~ x — 1) > e~ v (y — x) for < x < y, we have: 

r— 1 

Now the term a s,jPj-s is the substitution rate at site s, and so we can set g(u, v) — 2u s e~ 2Vs . 

Thus, if we let = E[d\f] then 

(13) $ = /#> and $ - $ > g(e k , L k ), 
where e k = min{w s : 1 < s < k} and L k = max{n s : 1 < s < k}. 



As in the previous proof, by the continuity of g and its other listed properties, we can allow e k to tend to 
zero and L k to tend to infinity sufficiently slowly (with increasing k) that limfc—xx, k ■ g 2 (e k , L k ) — * oo. Then 
by Azuma's inequality: 



(14) 



lim P( max 1 4*° - 

k—>oc V ij J J 



> ^g(e k ,L k ) j =i). 



Note that, by (fT5|) the /x values are additive on T and eac h interio r edge has a branch length of at least 



g(e k ,L k ). We can thus invoke the 'safety radius' result of 



Atteson 



(|1999i ) which guarantees that 
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(k) 

neighbor-joining applied to the matrix of d\j values, for all pairs will return T provided that each 
pairwise distance differs from fj,^ by at most \g(eki This last event has probability converging to 
1 as k grows by (fT4|) and so Part (iii) now follows from Lemma I5~2l 

Part (iv): For any data consisting of a sequences of characters on a set of species, the only phylogenetic 
trees that have positive likelihood under the NCM-TVqo model are those on which the data are 
homoplasy-free (i.e. require no reverse or convergent evolutionary events). Thus, it suffices to show that for 
k characters generated by a NCM-TVoo model on T, the probability that T is the only phylogenetic tree for 
the given spec i es on which these characters are homoplasy-free converges to 1 as k — > oo. Following 



Warnow et al 



(2006), it suffices to show that the following event E k has probability converging to 1 as k 
grows: Ek is the event that for each induced quartet tree ab\cd = T\{a, b, c, d} of T, at least one of the k 
characters assigns the same state to a and b, and the same state to c and d, and with these two states 
being different. By the independence assumption between changes on different edges in the Noo model, and 
by the Bonferroni inequality, we have: 



(15) F(E k )>l-( )-H(l- Psq i) 



k 

n > 



8 = 1 



where p s (respectively q s ) is the smallest substitution probability on an edge (respectively the largest 
substitution probability on a path) for the process that generates site s. Thus, provided that the branch 
lengths at site s are bounded between (e^, Lp.) where e k converges to sufficiently slowly, and that L k 
converges to infinity sufficiently slowly (with increasing k), then lim^oo V{E k ) = 1, by (|15p . Part (iv) now 
follows from Lemma [ 



8.2. Proof that ML under an e constrained NCM can be statistically inconsistent. For 

u = (ui, . . . , Uk) € U k and a fully resolved tr ee a, let L n (u) be the lo g of the maximum likelihood value of 



the data u under the NCM-iV r model. From 



Tufflev and Steel p997l) we have: 



(16) 



L a (u) = -(/(u,o) + fe)-log(r), 
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where Z(u, a) is the parsimony score of u on a. Similarly, for e > 0, let L^(u) be the log of the maximum 
likelihood value of the data u under the constrained NCM-iV r model on tree a in which each branch length 
is required to lie between e and — log(e). Clearly, L c a (u) < L a (u). 

Consider the following way to 'prune' branch lengths in any tree c which associates to each vector of 
branch lengths 9 a corresponding set of branch lengths 8 e that satisfy the e constraint: For each branch 
length shorter than e re-set that branch length to e, and for each branch length larger than — log(e) reset it 
to — log(e). This transformation 9 i— > 9 e enjoys the following property for the N r model: For any site 
pattern u G U we have: 

F{u\c,0 e ) >F(u\c,0)-O(e), 

where F(u\c,0( e ') is the probability of generating u on tree c with branch lengths 9^ and where 0(e) is a 
term that depends just on e and the number of leaves in the tree, and which tends to zero as e — > 0. It 
follows that: 

(17) i^(u)>ii c (u)-0( e ). 



Now, by elementary algebra: 

(18) i(L=(u)-Lg(u)) = A 1 + A 2 + A 3 

where: 

Ai = i(^(u)-L a (u))<0, 

A 2 = ^(i,(u) - L b (u)), and A 3 = ^(L b (u) - LJ(u)). 
Now, for any two trees a, 6, Eqn. (|16p gives: 

d9) A ^{^- ls ¥-)- i °^- 

If u is generated by a CM-iV r on a with branch lengths in the interior of the Felsenstein Zone (a region of 
branch lengths for tree a where maximum parsimony converges on an incorrect tree) then, for tree b having 
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a different topology from a the term '^"'^ — l S}hSl i n ([19]) converges in probability to a negative constant 
—C (the actual value of which is dependent on the branch lengths used in the Felsenstein Zone setting). 

Regarding A3, we can apply Inequality (fT"7|) for c = b and select e sufficient small (but strictly positive) so 
that (i) the branch lengths used in the Felsenstein Zone setting are all greater than e and less that — log(e) 
and (ii) the O(e) term in (jTTJ) is less than Clog(r) and so, for all k > 1 and all u S ?7 fe : 

(20) i(L 6 (u)-£|(u))<iciog(r). 
Thus, from (TH} and (HOI) we have: 

(21) i(££(u)-Zg(u))<A 2 + i<71og(r), 
for e > sufficiently small. 

Since A 2 converges in probability to — Clog(r) with increasing k, it follows from (|2ip that the probability 
that i(L^(u) — L£(u)) ^ s negative when (u\, . . . , Uk) is generated under the CM-iV r model tends to 1 as A; 
grows. That is, ML estimation under an NCM-iV^ model is statistically inconsistent, for data generated 
under the CM-A^ model (or a NCM— JV* model) when the branch lengths lie within the Felsenstein Zone 
for tree a, and e is chosen sufficiently small (relative to those branch lengths). 



